A Plagiarism Detection Approach Based on SVM for Persian Texts
نویسندگان
چکیده
Plagiarism is defined as an unauthorized act of using or adapting others’ works and ideas without referring to them. Numerous methods have been proposed to detect plagiarism in different languages; however, not a lot has been accomplished in Persian. The present study has utilized statistical and semantic features to determine the functionality of Support Vector Machines (SVMs) in detecting acts of plagiarism in Persian. To increase accuracy, a stemmer was designed to stem Persian words. The statistical and semantic features were used to train and apply the SVM. The statistical features used are Jaccard coefficient, Dice coefficient, Levenshtein distance, and Longest Common Subsequence. To detect semantic similarities, a new method called “Index Words Replacement” was proposed. The proposed framework was tested on PAN data set. The results show the precision of 0.93337, recall of 0.70124 and Plagdet of 0.80083.
منابع مشابه
Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کاملEnglish-Persian Plagiarism Detection based on a Semantic Approach
Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...
متن کاملA Text Alignment Algorithm Based on Prediction of Obfuscation Types Using SVM Neural Network
In this paper, we describe our text alignment algorithm that achieved the first rank in Persian Plagdet 2016 competition. The Persian Plagdet corpus includes several obfuscation strategies. Information about the type of obfuscation helps plagiarism detection systems to use their most suitable algorithm for each type. For this purpose, we use SVM neural network for classification of documents ac...
متن کاملMahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems
In this paper we introduce Mahak Samim, a plagiarism detection corpus that consists of Persian academic texts in which plagiarism cases are embedded. This corpus, which can be used for evaluating plagiarism detection systems, consists of more than five thousand artificial plagiarism cases with various lengths and diverse degrees of obfuscation. The development process and the features of the co...
متن کاملExternal Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...
متن کامل